Cosmos: Fix 410/1002 PartitionKeyRangeGone surfacing on query/change-feed paths during split/merge#49436
Open
NaluTripician wants to merge 2 commits into
Open
Conversation
…ring split/merge The query-path PartitionKeyRangeGoneRetryPolicy retried 410/1002 only once and ignored 410/1007 (CompletingSplitOrMerge) and 410/1008 (CompletingPartitionMigration), surfacing transient 410s to query callers during a partition split/merge. It now refreshes the routing map and retries those sub-statuses up to 10 times (using an AtomicInteger counter), matching the bulk/transactional-batch retry policies. Mirrors the .NET SDK fix (Azure/azure-cosmos-dotnet-v3 PR #5941). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed paths Self-review noted PartitionKeyRangeGoneRetryPolicy is also used by the change-feed reader path (ChangeFeedFetcher), not only queries. Clarify the changelog scope accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Cosmos DB Java SDK’s internal PartitionKeyRangeGoneRetryPolicy to better handle transient 410 (Gone) responses during partition topology transitions (split/merge/migration) on query and change-feed request paths, aligning behavior with other retry policies in the SDK.
Changes:
- Replaced one-shot retry (
volatile boolean retried) with a bounded retry budget usingAtomicInteger(max 10 retries). - Expanded the handled
410sub-status codes from only1002to also include1007and1008, forcing routing-map refresh and retrying immediately (Duration.ZERO). - Updated
CHANGELOG.mdto describe the corrected behavior for query and change-feed paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/PartitionKeyRangeGoneRetryPolicy.java | Adds bounded retry budget and broadens 410 sub-status handling with per-attempt routing-map refresh. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the fix and the expanded 410 sub-status retry behavior. |
Comment on lines
+103
to
+104
| return refreshedRoutingMapObs.flatMap(rm -> | ||
| Mono.just(ShouldRetryResult.retryAfter(Duration.ZERO))); |
Comment on lines
+29
to
+30
| private static final int MAX_RETRY_COUNT = 10; | ||
| private final AtomicInteger retryCount = new AtomicInteger(0); |
| #### Breaking Changes | ||
|
|
||
| #### Bugs Fixed | ||
| * Fixed transient `410/1002` (`PartitionKeyRangeGone`) errors surfacing to callers during a partition split or merge. The `PartitionKeyRangeGoneRetryPolicy` (used on the query and change-feed paths) previously retried only once and ignored the in-progress `410/1007` (`CompletingSplitOrMerge`) and `410/1008` (`CompletingPartitionMigration`) sub-status codes; it now refreshes the routing map and retries those sub-statuses up to 10 times before surfacing the error. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
PartitionKeyRangeGoneRetryPolicypreviously:410/1002(PartitionKeyRangeGone) only once (volatile boolean retried), and410/1007(CompletingSplitOrMerge) and410/1008(CompletingPartitionMigration) — they fell through to the next policy.As a result, during a slow / in-progress partition split or merge, callers could see a transient
410surfaced as a hard error. The bulk (BulkOperationRetryPolicy) and transactional-batch (TransactionalBatchRetryPolicy) policies already handle all three sub-statuses with a bounded retry budget — this policy was the outlier.Fix
boolean retriedwith a boundedAtomicInteger retryCount(MAX_RETRY_COUNT = 10), matching theAtomicIntegerpattern already used byGoneAndRetryWithRetryPolicy.PARTITION_KEY_RANGE_GONE+COMPLETING_SPLIT_OR_MERGE+COMPLETING_PARTITION_MIGRATION, force-refreshing the routing map per attempt andretryAfter(Duration.ZERO).This mirrors the .NET SDK fix in Azure/azure-cosmos-dotnet-v3 #5941.
Scope note for reviewers
PartitionKeyRangeGoneRetryPolicyis constructed on two paths: the query path (DefaultDocumentQueryExecutionContext) and the change-feed reader path (ChangeFeedFetcher). This change therefore affects change-feed split/merge handling as well — please consciously sign off on that. The changelog wording has been broadened to "query and change-feed paths" to reflect this.Self-review (no double-handling)
A deep self-review confirmed the query/change-feed policy and the transport-layer
GoneAndRetryWithRetryPolicy(point reads/writes viaReplicatedResourceClient) operate on mutually exclusive code paths, so adding 1007/1008 here does not double-handle. Retry-count semantics verified:getAndIncrement()yields exactly 10 retries (prior values 0–9) then surfaces on the 11th. Delegation tonextRetryPolicyfor non-matching exceptions is preserved; the reactiveMonoflow is intact.CI
All build/test jobs pass. The single red leaf job —
Test Emulator windows2022_Spark35Scala213IntegrationTests…Java17→ChangeFeedPartitionReaderITest "should honor endLSN during split and should hang"— is a timing flake, not caused by this change: the same test passed in the sibling Spark 3.5 / Scala 2.12 job in the same build (the Java SDK class under review is byte-identical across those two jobs). The assertion (future.isCompleted shouldEqual trueafter a fixed sleep + poll) is inherently racy under emulator ingestion timing.This change alters behavior of a previously-untested class (it still carried
// TODO: this need testing). Unit tests should be added before merge. I could not run the Maven build locally, so the existing coverage relies on CI. RecommendedStepVerifiercases:410/1002,410/1007,410/1008each →retryAfter(ZERO)+ routing-map force-refresh invoked.ShouldRetryResult.error(...)(locks the 10-retry boundary).nextRetryPolicy.shouldRetryexactly once.tryLookupAsync→ no NPE, returns retry.🤖 Generated via the Seon workflow (cross-SDK port of the .NET 410/1002 fix; mandatory self-review + CI adjudication applied).